System and method for pick-and-drop sampling

ABSTRACT

A database system includes an input to a database server configured to deliver a data stream formed of a sequence of elements, D={p 1 , p 2 , . . . , p m } of size m of numbers from {1, . . . , n} to the database server. The system further includes a computer program that causes a processor to approximate frequency moments (F k ) in the data stream, such that a frequency of an element (i) is defined as f i =|{j:p j =i}| and a k-th frequency moment of D is defined as 
     
       
         
           
             
               F 
               k 
             
             = 
             
               
                 ∑ 
                 
                   i 
                   = 
                   1 
                 
                 n 
               
                
               
                 
                   m 
                   i 
                   k 
                 
                  
                 
                     
                 
                  
                 in 
                  
                 
                     
                 
                  
                 a 
               
             
           
         
       
     
     single pass through the data stream. The processor is caused to carry out the steps of locating elements (i) with a frequency ΩF k  in the data stream as heavy elements and approximating f i  as ≧ a fraction of f i  to limit memory resources used by the processor to estimate F k  to O(n 1−2/k  log(n)) bits.

BACKGROUND OF THE INVENTION

The present invention relates generally to systems and methods forsignal processing. More specifically, the present invention relates to asystem and method for estimating frequency moments in data streamprocessing algorithms.

In many signal processing application, the signal must be processed in afew or, often, just one pass. For example, a “data stream” is oftenreferred to as a sequence of data that is too large to be stored, in itsentirety, in memory. Such data streams are common in communicationsnetwork traffic, database transactions, satellite data feeds, and thelike. In such instances, “streaming algorithms” are used to processthese signals as data streams forming an input presented as a sequenceof items that can be examined in very few passes. A common example of astreaming algorithm is one developed to count the number of distinctelements in a data stream. The continuous nature of the underlyingsignal and the resource constraints that limit the amount of repetitiveprocessing performed general result in the algorithm producing anapproximate answer based on a summary of the data stream that is stored.

Alon, Matias, and Szegedy, in The Space Complexity of Approximating theFrequency Moments, Journal of Computer and System Sciences, 58:137-147,1999, which is incorporated herein by reference in its entirety,approached such a signal processing problem, within the context ofdatabase processing, and introduced the concept of “frequency moments.”Namely, for a sequence of elements, D={p₁, p₂, . . . , p_(m)} of size mof numbers from {1, . . . , n}, a frequency of an element, I, is definedas:

f _(i) =|{j:p _(j) =i}|  Eqn. 1.

The k-th frequency moment of D is defined as:

$\begin{matrix}{F_{k} = {\sum\limits_{i = 1}^{n}{m_{i}^{k}.}}} & {{Eqn}.\mspace{14mu} 2}\end{matrix}$

Alon, Matias, and Szegedy, when approaching the problem of approximatingfrequency moments in one pass over D and using sublinear space, observeda striking difference between “small” and “large” values of k.Specifically, it is possible to approximate F_(k) for k≦2 inpolylogarithmic space. However, polynomial space is required when k>2.Since the work of Alon, Matias, and Szegedy in the late 1990's,approximating F_(k) has become one of the most inspiring problems in thetheory of data streams.

For example, many have focused on efficient algorithms for estimatingparticular moments, such as F₂, which is useful for computingstatistical properties of the data. Others have focused on bounding thememory required of F_(k) approximation algorithms. For example, manyproposed solutions or bounds having accuracy up to a polylogarithmicfactor. However, as noted above, since a polynomial space is requiredwhen k>2, suitably efficient approximations for frequency moments fork≧3 have been lacking.

It would therefore be desirable to provide a system and method forapproximating frequency moments with a reduced space complexity thantraditional dictated for k≧3.

SUMMARY OF THE INVENTION

The present invention overcomes the aforementioned drawbacks byproviding a method of non-uniform sampling to find frequency moments ina data stream.

In accordance with one aspect of the invention, a database systemincludes a database and a database server configured to control readingdata from and writing data to the database. The system also includes aninput to the database server configured to deliver a data stream formedof a sequence of elements, D={p₁, p₂, . . . , p_(m)} of size m ofnumbers from {1, . . . , n} to the database server. The system furtherincludes a non-transitive, computer-readable storage medium, havingstored thereon, a computer program that, when executed by a processor,causes the processor to approximate frequency moments (F_(k)) in thedata stream, such that a frequency of an element (i) is defined asf_(i)=|{j:p_(j)=i}| and a k-th frequency moment of D is defined as

$F_{k} = {\sum\limits_{i = 1}^{n}m_{i}^{k}}$

in a single pass through the data stream. The processor is caused tocarry out the steps of (a) arranging a portion of the data stream in amatrix, (b) selecting an initial element in the matrix, and (c) checkingthe matrix for a duplicate of the initial element. Upon identifying aduplicate of the initial element in the matrix, the processor is causedcarry out the step of (d) assuming that the initial element appears ineach row of the matrix, assigning binary values to all otherfrequencies, and disregarding the initial element. Upon completing step(c) without identifying a duplicate of the initial element, theprocessor is caused to carry out the step of (e) assigning a binaryvalue to all frequencies. Steps (b) through (e) are repeated as step (f)for a each subsequent element in the matrix and step (g) is performed bygenerating a report of approximated frequency moments in the data stream

In accordance with another aspect of the invention, a method forapproximating frequency moments (F_(k)) in data streams formed of asequence of elements, D={p₁, p₂, . . . , p_(m)} of size m of numbersfrom {1, . . . , n} is disclosed, such that a frequency of an element(i) is defined as f_(i)=|{j:p_(j)=i}| and a k-th frequency moment of Dis defined as

$F_{k} = {\sum\limits_{i = 1}^{n}{m_{i}^{k}.}}$

The method includes the steps of (a) arranging a portion of the datastream in a matrix, (b) selecting an initial element in the matrix, and(c) checking the matrix for a duplicate of the initial element. Themethod also includes (d) upon identifying a duplicate of the initialelement in the matrix, assuming that the initial element appears in eachrow of the matrix, assigning binary values to all other frequencies, anddisregarding the initial element. The method further includes, (e) uponcompleting step (c) without identifying a duplicate of the initialelement, assigning a binary value to all frequencies. The methodadditionally includes (f) repeating steps (b) through (e) for eachsubsequent element in the matrix and (g) generating a report ofapproximated heavy elements.

In accordance with another aspect of the present invention, a a databasesystem is disclosed that includes a database and a database serverconfigured to control reading data from and writing data to thedatabase. The system also includes an input to the database serverconfigured to deliver a data stream formed of a sequence of elements,D={p₁, p₂, . . . , p_(m)} of size m of numbers from {1, . . . , n} tothe database server. The system further includes a non-transitive,computer-readable storage medium, having stored thereon, a computerprogram that, when executed by a processor, causes the processor toapproximate frequency moments (F_(k))) in the data stream, such that afrequency of an element (i) is defined as f_(i)=|{j:p_(j)=i}| and a k-thfrequency moment of D is defined as

$F_{k} = {\sum\limits_{i = 1}^{n}m_{i}^{k}}$

in a single pass through the data stream. The processor is caused tocarry out the steps of locating elements (i) with a frequency ΩF_(k) inthe data stream as heavy elements and approximating f_(i) as ≧ afraction of f_(i) to limit memory resources used by the processor toestimate F_(k) to O(n^(1−2/k) log(n)) bits.

The foregoing and other aspects and advantages of the invention willappear from the following description. In the description, reference ismade to the accompanying drawings which form a part hereof, and in whichthere is shown by way of illustration a preferred embodiment of theinvention. Such embodiment does not necessarily represent the full scopeof the invention, however, and reference is made therefore to the claimsand herein for interpreting the scope of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a system for use with the presentinvention.

FIG. 2 is a flow chart setting forth the steps of an exemplary method inaccordance with the present invention and for use with systems, such asillustrated in FIG. 1.

DETAILED DESCRIPTION OF THE INVENTION

1. Introduction.

Referring to FIG. 1, a system for implementing an algorithm inaccordance with the present invention is illustrated. As shown in FIG.1, the system 10 includes a control program 12 and a memory 14 which,for example, may be resident in a database server 16 supporting database18. Every data item to be input to the database 18 input through a datainput channel 20 where the data item is processed by the control program12.

As will be described, when a command is received from a computer system22 accessing the system 10 the over a control channel 24 to provide anestimate of a frequency moment, the control program 12 is invoked togenerate an estimated frequency moment for output on output channel 26.To this point, the present invention provides a system and method forapproximating frequency moments in insertion-only data streams for k≧3.For any constant, c, the present invention can show an O(n^(1−2/k)log(n)log^((c))(n)) upper bound on the space complexity of the problem.Here log(c)(n) is the iterative log function. To simplify thepresentation, the following assumptions can be made:

n and m are polynomially far; and

approximation error ε and parameter k are constants.

A natural bijection between streams and special matrices can beobserved. As described hereafter, the present invention provides anon-uniform sampling method referred to herein as “a pick-and-dropsampling.”

To illustrate a pick-and-drop method in accordance with the presentinvention, an example can be utilized where m=r*t and r=[n^(1/5)]. Inthis context and referring to FIG. 2, the following example of asampling method 200 in accordance with the present invention isdescribed. At process block 202, the data stream is arranged into amatrix. Arranging the data stream in the matrix may be achieved via alogical arrangement. By way of the present non-limiting example,consider r×t matrix, M, with entries m_(ij)=p_(k(i−1)+j). For m≦n, thefollowing promise problem can be solved with probability ⅔:

case 1—all frequencies are either zero or one;

case 2—z appears in every row of M exactly once and, thus, f_(z)=r; and

all other frequencies are either zero or one.

At process block 204, an element is selected for analysis. The elementmay be selected at random. Specifically, r independent and identicallydistributed (i.i.d.) random numbers I₁, . . . , I_(r), can be picked,where I_(i) is uniformly distributed on {1,2, . . . , t}. At decisionblock 206, for a given element (and, as will be described, each i=1 . .. r−1), a check for a duplicate of m_(i,I) _(i) in the row i+1 is made.If the duplicate is found, in this example, then we output “case 2” byassuming that z appears in every row of M exactly once. More generally,at process block 208, the i-th sample is “dropped,” and, at processblock 210, another element is selected, for example, to “pick” the(i+1)-th sample. This process is repeated T times independently.

Returning to decision block 206, if no duplicate is found, at decisionblock 212, a check is made to determine whether the end of the matrixhas been reached. If not, another element is selected at process block210. Once the end of the matrix is reached, at process block 212, areport is generated, for example, to report the “heavy elements.” Heavyelements are the element that appear often in the data stream. Frequencymoments area a function of the data stream. As will be described, thepresent invention provides a system and method for determining afrequency moment using heavy elements, but the heavy element can be usedfor other and additional purposes.

By way of the present example, “case 1” is output in the report if noduplicate is found. Note, if the input represents case 1, the methodwill always output “case 1.” Consider case 2 and observe that, ifm_(i,I) _(i) =z, then case 2 will be output. Indeed, since z appears inevery row, the duplicate of z will be found. The probability to miss zentirely is:

$\begin{matrix}{\left( {1 - \frac{1}{t}} \right)^{r\; T}.} & {{Eqn}.\mspace{14mu} 3}\end{matrix}$

Recall that m≦n,m=rt,r=[n^(1/k)]. If T=O(n^(1−2/k)) with sufficientlylarge constant, then the probability of error with respect to eqn. 3 issmaller than ⅓. Accordingly, the promise problem can be resolved withO(n^(1−2/k) log(n)) space. Notably, the solution depends upon r. Thus,in general, it is prudent to carefully select the matrix.

Unfortunately, the distribution of the frequent element in the streamcan be arbitrary. Also the algorithm should desirably recognize “noisy”frequencies that are large but negligible. Clearly, the sampling wouldbe desirable if more intricate but, luckily, it need not be renderedgreatly more complex.

Accordingly, still referring to FIG. 2, counters can be used.Specifically, a local counter can be introduced for each sample thatcounts the number of times appears in the suffix of the i-th row.Notably, such a counting method was used by Alon, Matias, and Szegedy,in The Space Complexity of Approximating the Frequency Moments, for theentire stream. In contrast, in the present invention it is contemplatedthat a global sample (and a global counter), as functions of the localsamples and counters, may be used. Initially, at process block 216 theglobal sample is the local sample of the first row and incremented atprocess block 218 only when the local counter indicates such.

Notably, under certain conditions, the global sample can be “dropped.”If this is the case, then the local sample of the current row is“picked” and becomes the new global sample. The global sample is“dropped” when the local counter exceeds the global one. Also, theglobal sample is dropped if the global counter does not grow fastenough. A function, λq, where λ is a parameter and q is the number ofrows that the global counter survived. If the global counter is smallerthan λq, then the global sample is “dropped.”

2. Pick-and-Drop Sampling

Let M be a matrix with r rows and t columns and with entries m_(i,j) ε[n]. For i ε [r], j ε [t], l ε [n] define:

$\begin{matrix}{{d_{i,j} = {\left\{ {{{j^{\prime}\text{:}j} \leq j^{\prime} \leq t},{m_{i,j^{\prime}} = m_{i,j}}} \right\} }};} & {{Eqn}.\mspace{14mu} 4} \\{{f_{l,i} = {\left\{ {{j \in {\left\{ t \right\} \text{:}m_{i,j}}} = l} \right\} }};} & {{Eqn}.\mspace{14mu} 5} \\{{f_{l} = {\left\{ {{\left( {i,j} \right)\text{:}m_{i,j}} = l} \right\} }};} & {{Eqn}.\mspace{14mu} 6} \\{{F_{k} = {\sum\limits_{l = 1}^{n}f_{l}^{k}}},{G_{k} = {F_{k} - {f_{1}^{k}.}}}} & {{Eqn}.\mspace{14mu} 7}\end{matrix}$

Note that there is a bijection between r×t matrices M and streams D ofsize r×t with elements p_(it+j)=m_(i,j), where the definitions withrespect to eqns. 2, 1 and 6, 7, define equivalent frequency vectors fora matrix and the corresponding stream. Without loss of generality,consider streams of size r×t for some r,t. The notions of a stream andits corresponding matrix can be interchanged.

Let {I_(j)}_(j=1) ^(r) be i.i.d. random variables with uniformdistributions on [t]. Define for i=1, . . . , r:

s_(i)=m_(i,I) _(i) , c_(i)=d_(i,I) _(i)   Eqn. 8.

Let λ be a parameter. Define the following recurrent random variables:

S₁=s₁, C₁=c, q₁=1   Eqn. 9.

Also (for i=2, . . . r), if:

(C_(i 1)<max{λq_(i 1),c_(i)})   Eqn. 10;

then define:

S_(i)=s_(i), C_(i)=c_(i), q_(i)=1   Eqn. 11;

otherwise, define:

S _(i) =S _(i−1) , C _(i) =C _(i−1) +f _(S) _(i) _(,i) , q _(i) =q_(i−1)+1   Eqn. 12.

Therefore, Theorem 2.1 states: If M is a r×t matrix, there existabsolute constants α,β such that, if:

$\begin{matrix}{{{\alpha \left( {{\lambda \; r} + \frac{G_{3}}{\lambda \; t} + \frac{G_{2}}{t}} \right)} \leq f_{1} \leq {\beta \; t}};} & {{Eqn}.\mspace{14mu} 13}\end{matrix}$

then:

$\begin{matrix}{{P\left( {S_{r} = 1} \right)} \geq {\frac{f_{1}}{2\; t}.}} & {{Eqn}.\mspace{14mu} 14}\end{matrix}$

Proof. Denote Q={(i, j):m_(i,j)=1}. For (i, j) ε Q, define:

T _(i,j)= (A _(i,j) ∪B _(i,j) ∪H _(i,j))  Eqn. 15;

where for i>1:

A _(i,j)=((C _(i−1) ≧d _(i,j))∩(S _(i−1)≠1))   Eqn. 16;

for i<r:

$\begin{matrix}{{B_{i,j}\left( {\underset{h = {i + 1}}{\bigcup\limits^{r}}\left( {{d_{i,j} + {\sum\limits_{u = {i + 1}}^{h - 1}f_{1,u}}} < c_{h}} \right)} \right)};} & {{Eqn}.\mspace{14mu} 17} \\{{H_{i,j}\left( {\overset{r}{\bigcup\limits_{h = {i + 1}}}\left( {{d_{i,j} + {\sum\limits_{u = {i + 1}}^{h - 1}f_{1,u}}} < {\left( {h - 1} \right)\lambda}} \right)} \right)};} & {{Eqn}.\mspace{14mu} 18}\end{matrix}$

and A_(i,j)=B_(r,j)=H_(r,j)=0. We have:

((s _(i)=1)∩(S _(i 1))∩ A_(i,I) _(i) ⊂((s _(i)=1)∩(C _(i 1) >c _(i)))⊂((S _(i)=1)∩(q _(i)=1))   Eqn. 19

Consider the case when S_(i)=1 and q_(i)=1 and

${d_{i,I_{i}} + {\sum\limits_{u = {i + 1}}^{h - 1}f_{1,u}}} \geq {{\max \left( {{\lambda \left( {h - i} \right)},c_{h}} \right)}\mspace{14mu} {for}\mspace{14mu} {all}\mspace{14mu} h} > {i.}$

In this case S_(h) will be defined by eqn. 12 and not by eqn. 11. Inparticular, S_(h)=S_(i)=1. Therefore:

$\begin{matrix}\left( {{\left( {S_{i} = 1} \right)\bigcap\left( {q_{i} = 1} \right)\bigcap\overset{\_}{B_{i,I_{i}}}\bigcap\overset{\_}{H_{i,I_{i}}}} \subseteq {\left( {\overset{r}{\bigcap\limits_{h = i}}\left( {S_{h} = 1} \right)} \right).}} \right. & {{Eqn}.\mspace{14mu} 20}\end{matrix}$

Define V₁=((s₁=1)∩T_(i,I) ₁ ) and, for i>1,V_(i)=((s_(i)=1)∩(S_(i−1)≠1)∩T_(i,I) _(i) ) . It follows from eqns. 19and 20 that, for any i ε [r]:

V _(i) ⊂(S _(r)=1)   Eqn. 21;

V_(i)∩V_(j)=0   Eqn. 22.

Thus:

$\begin{matrix}{{\sum\limits_{i = 1}^{r}{P\left( {V_{i}\overset{r}{\bigcup\limits_{i = 1}}V_{i}} \right)}} \leq {{P\left( {S_{r} = 1} \right)}.}} & {{Eqn}.\mspace{14mu} 23}\end{matrix}$

For any i>1, P(V_(i))≧P((s_(i)=1)∩T_(i,I) _(i) )−P(s_(i)=S_(i−1)=1).Also,

$\left( {{{\sum\limits_{i = 2}^{r}{{P\left( {s_{i} = {S_{i - 1} = 1}} \right)}{\sum\limits_{i = 2}^{r}{P\left( {\left( {s_{i} = I} \right)\bigcap\left( {\bigcup_{h \neq i}\left( {s_{h} = 1} \right)} \right)} \right)}}}} \leq \left( {\sum\limits_{i = 1}^{r}{P\left( {s_{i} = 1} \right)}} \right)^{2}} = {\left( \frac{f_{1}}{t} \right)^{2}.}} \right.$

For any fixed (i, j)εQ events I_(i)=j and T_(i,j) are independent.Indeed, A_(i,j) is defined by {S_(i−1), C_(i−1)} that, in turn, iddefined by {I₁, . . . , I_(i−1)}. Similarly, B_(i,j) is defined by{I_(i+1), . . . , I_(r)}. Note that H_(i,j) is a deterministic event. Bydefinition, {I₁, . . . , I_(i−1), I_(i−1), . . . , I_(r)} areindependent of I_(i). Thus, event I_(i)=j and T_(i,j)=(A_(i,j)∪B_(i,j)∪H_(i,j)) are independent. Thus:

$\begin{matrix}\begin{matrix}{{\sum\limits_{i = 2}^{r}{P\left( {\left( {s_{i} = 1} \right)\bigcap T_{i,I_{i}}} \right)}} = {\sum\limits_{{({i,j})} \in Q}{P\left( {\left( {I_{i} = j} \right)\bigcap T_{i,j}} \right.}}} \\{= {\sum\limits_{{({i,j})} \in Q}{{P\left( {I_{i} = j} \right)}{P\left( T_{i,j} \right)}}}} \\{= {\frac{1}{t}{\sum\limits_{{({i,j})} \in Q}{{P\left( T_{i,j} \right)}.}}}}\end{matrix} & {{Eqn}.\mspace{14mu} 24}\end{matrix}$

Thus,

${P\left( {S = 1} \right)} \geq {{\frac{1}{t}{\sum\limits_{{({i,j})} \in Q}{P\left( T_{i,j} \right)}}} - {\left( \frac{f_{1}}{t} \right)^{2}.}}$

As will be clear, Lemma 2.2 implies that Σ_((i,j)εQ)P(T_(i,j))≧0.8f₁.Thus, if β<0.3, then

${{P\left( {S_{r} = 1} \right)}\frac{f_{1}}{t}\left( {0.8 - \frac{f_{1}}{t}} \right)} \geq {\frac{f_{1}}{2t}.}$

Here, only the second part of eqn. 13 was used. The first part is usedin the proof of Lemma 2.2.

Lemma 2.2. There exists absolute constants α,β such that eqn. 13 implies

${\sum\limits_{{({i,j})} \in Q}{P\left( T_{i,j} \right)}} > {0.8\mspace{14mu} {f_{1}.}}$

It follows from Lemmas 2.9, 2.17, and 2.14 and the union bound thatthere exists at lest 0.97f₁ pairs (i, j) εQ, such thatP(A_(i,j)∪B_(i,j)∪H_(i,j))≦0.02. Recall that T_(i,j)=(A_(i,j)∪B_(i,j)∪H_(i,j)). The lemma follows.

Events of type A.

For (i, j) ε Q _(such that) i>1 and for l>1, define:

${Y_{l,{({i,j})}} = {1_{A_{i,j}}1_{({S_{i - 1} = l})}}},{Y_{l,i} = {\sum\limits_{{j \in {\lbrack t\rbrack}},{{({i,j})} \in Q}}Y_{l,{({i,j})}}}},{Y_{l} = {\sum\limits_{i = 2}^{r}Y_{l,i}}},{Y = {\sum\limits_{l = 2}^{n}Y_{l}}},$

Fact 2.3. C_(i)≦fs_(i),i. Also, if q_(i)=1, then C_(i)≦fs_(i),i.

Proof. It follows directly from eqns. 11 and 12 that it is sufficient toprove, for any i, there exists a set Q_(i), such that C_(i)≦|Q_(i)| and,simultaneously, Q_(i) is a subset of {(i′,j):m_(i′,j)=S_(i), i′≦i}.Through the induction on i, the above claim can be proven. For i=1, theclaim is true since we can define Q₁={(1, j):j≧I₁}. For i>2, thedescription of the algorithm implies the following. If q_(i)=1, then wecan put Q_(i)={(i, j):j≧I_(i)}. If q_(i)>1, then defineQ_(i)=Q_(i−1)∩{(i, j):m_(i,j)=S_(i)}. Note that in this case,S_(i)=S_(i−1). The second part follows form the description of thealgorithms. Namely, if p_(i)=1, then C_(i)=c_(i), S_(i)=s_(i), andc_(i)=d_(i,I) _(i) (s)≦f_(s) _(i) _(,i).

Fact 2.4.

1. Y_(l,i)≦f_(l),

2. If q_(i'1)=1, then Y_(l,i)≦f_(l,i−1)

Proof. Let (i, j) ε Q be such that d_(i,j)>f_(l); thenY_(l,(i,j))=1_((c) _(i−1) _(≧d) _(i,j) ₎1_((s) _(i−1) ₌₁₎ =1(d _(l)_(≧c) _(i−1)) 1_((c) _(i−1) _(≧d) _(i,j)) 1_((s) _(i−1) ₌₁₎. Use Fact2.3 for this last equality. Thus, Y_(l,(i,j))=0. The definition ofd_(i,j) implies |{j:(i,j)εQ,d_(i,j)≦f_(l)f}|≦f_(l) for any fixed i andl. Thus,

$Y_{l,i} = {{\sum\limits_{{j \in {\lbrack t\rbrack}},{{({i,j})} \in Q}}Y_{l,{({i,j})}}} \leq {f_{l}.}}$

Part 2 follows by repeating the above arguments and using the secondstatement of Fact 2.3.

Definition 2.5. Let 1≦r₁≦r₂≦r and l ε [n]. Call a pair [r₁,r₂] anI-epoch if ∀i=r₁, . . . , r₂: S_(i)=l and q_(r1)=q_(r2+1)=1, and∀i=r+1₁, . . . , r₂:q_(i)=q_(i−1)+1.

Lemma 2.6. Let [r₁,r₂] be an I-epoch. If r₂>r₁, then,

${r_{2} - r_{1}} \leq {\frac{1}{\lambda}{\sum\limits_{i = r_{1}}^{r_{2} - 1}{f_{l,i}.}}}$

Proof. First, observe that q_(r) ₂ ⁻¹=r₂−r₁. Second, q_(i)>1 impliesthat S_(i) is defined by eqn. 12 and not by eqn. 11 for all r₁<i≦r₂. Inparticular, C_(r1)≦f_(l,r) ₁ and for r₁<i≦r₂ we haveC_(i)=C_(i−1)+f_(l,i). Thus,

$C_{r_{2} - 1} \leq {\sum\limits_{i = r_{1}}^{r_{2} - 1}{f_{l,i}.}}$

Third, C_(r) ₂ ⁻¹≧λq_(r) ₂ ⁻¹, since eqn. 10 must be false for i=r₂.Therefore,

${r_{2} - r_{1}} = {{q_{r_{2}} - 1} \leq {\frac{1}{\lambda}C_{r_{2} - 1}} \leq {\frac{1}{\lambda}{\sum\limits_{i = r_{1}}^{r_{2} - 1}{f_{l,i}.}}}}$

Lemma 2.7.

$Y_{l} \leq {\frac{f_{l}^{2}}{\lambda} + {f_{l}.}}$

Proof. Observe that the set {i:S_(i)=l} is a collection of disjointI-epochs. Recall that and Y_(l)Σ_(i=2) ^(r) Y_(l,i) and Y_(l,i) isnon-zero only if S_(i−1) is equal to l. Thus, Y_(l) can be rewritten as

$Y_{l} = {\sum\limits_{{{({r_{1},r_{2}})}{is\_ an}{\_ l}} - {epoch}}^{\;}{\left( {\sum\limits_{i = {r_{1} + 1}}^{r_{2} + 1}Y_{l,i}} \right).}}$

For any epoch such that r₂>r₁, we have by Lemmas 2.4 and 2.6:

$Y_{l} = {{{\sum\limits_{{{({r_{1} < r_{2}})}{is\_ an}{\_ l}} - {epoch}}^{\;}\left( {\sum\limits_{i = {r_{1} + 1}}^{r_{2} + 1}Y_{l,i}} \right)} + {\sum\limits_{{{({r_{1} = r_{2}})}{is\_ an}{\_ l}} - {epoch}}^{\;}Y_{l,{r_{2} + 1}}}} = {{{\sum\limits_{{{({r_{1} < r_{2}})}{is\_ an}{\_ l}} - {epoch}}^{\;}\left( {\sum\limits_{i = {r_{1} + 1}}^{r_{2}}Y_{l,i}} \right)} + {\sum\limits_{{{({r_{1},r_{2}})}{is\_ an}{\_ l}} - {epoch}}^{\;}Y_{l,{r_{2} + 1}}}} \leq {{\frac{f_{l}}{\lambda}{\sum\limits_{{{({r_{1} < r_{2}})}{is\_ an}{\_ l}} - {epoch}}^{\;}\left( {\sum\limits_{i = r_{1}}^{r_{2} - 1}f_{l,i}} \right)}} + {\sum\limits_{{{({r_{1},r_{2}})}{is\_ an}{\_ l}} - {epoch}}^{\;}f_{l,{r_{2} + 1}}}} \leq {\frac{f_{l}^{2}}{\lambda} + {f_{l}.}}}}$

Lemma 2.8.

${P\left( {Y_{l} > 0} \right)} \leq {\frac{f_{l}}{t}.}$

Proof. Since I_(i) are independent and

$0 \leq \frac{f_{l,i}}{t} \leq 1$

we can apply Fact 2.10 as

${P\left( {\bigcap_{i = 1}^{r}\left( {m_{i,I_{i}} \neq l} \right)} \right)} = {{\prod\limits_{i = 1}^{r}\; \left( {1 - \frac{f_{l,i}}{t}} \right)} \geq {\left( {1 - \frac{f_{l}}{t}} \right).}}$

Thus:

$\begin{matrix}{{{P\left( {Y_{l} > 0} \right)}{P\left( {\bigcup_{i = 1}^{r}\left( {m_{i,I_{i}} = l} \right)} \right)}} \leq {\frac{f_{l}}{t}.}} & {{Eqn}.\mspace{14mu} 25}\end{matrix}$

Lemma 2.9. There exists an absolute constant α such that eqn. 13 impliesthat P(A_(i,j))≦0.01 for at least 0.99 f₁ pairs (i,j) ε Q.

Proof. From Lemmas 2.7 and 2.8,

${{E\left( Y_{l} \right)} \leq {\frac{f_{l}}{t}\left( {\frac{f_{l}^{2}}{\lambda} + f_{l}} \right)}},{{E(l)} = {{\sum\limits_{l = 2}^{n}{E\left( Y_{l} \right)}} \leq {\frac{G_{3}}{\lambda \; t} + {\frac{G_{2}}{t}.}}}}$

It follows that Σ_((i,j)⊂Q)1_(A) _(i,j)−Y. Recall that by eqn. 13,

${Q} = {f_{1} \geq {\alpha \left( {\frac{G_{3}}{\lambda \; t} + \frac{G_{2}}{t}} \right)} \geq {\alpha \; {{E\left( {\sum\limits_{{({i,j})} \in Q}^{\;}1_{{Ai},j}} \right)}.}}}$

Fact 2.11 implies that there exists an absolute constant α such that thelemma is true.

The following fact is well known. For completeness, the proof ispresented below.

Fact 2.10. Let α₁, . . . , α_(r) be real numbers in [0,1]. Then,

${\prod\limits_{i = {{1i} = 1}}^{r}\; \left( {1 - \alpha_{i}} \right)} \geq {1 - {\left( {\sum\limits^{r}\alpha_{i}} \right).}}$

Proof. If Σ_(i=1) ^(r)α_(l)≧1, then

${\prod\limits_{i = {{1i} = 1}}^{r}\; \left( {1 - \alpha_{i}} \right)} \geq 0 \geq {1 - {\left( {\sum\limits^{r}\alpha_{i}} \right).}}$

Thus, we can assume that Σ_(i=1) ^(r)α_(i)<1. This claim can be provenby induction on r. For r=2, we obtain(1−α₁)(1−α₂)=(1−α₁−α₂x+α₁α₂)≧(1−α₁−α₂). For r>2, we have, by induction,

${\prod\limits_{i = 1}^{r}\; \left( {1 - \alpha_{i}} \right)} \geq {\left( {1 - \left( {\sum\limits_{i = 1}^{r - 1}\alpha_{i}} \right)} \right)\left( {1 - \alpha_{r}} \right)} \geq {1 - {\left( {\overset{r}{\sum\limits_{i = 1}}\alpha_{i}} \right).}}$

Fact 2.11. Let X₁, . . . , X_(u) be a sequence of indicator randomvariables. Let S={i:P(X_(l)=1)≦v}. If E(Σ_(i=1) ^(u)X_(i))≦μu then

${S} \geq {\left( {1 - \frac{\mu}{v}} \right){u.}}$

Proof. Indeed,

${\mu \; u} \geq {\sum\limits_{i \notin S}^{\;}{P\left( {X_{i} = 1} \right)}} \geq {{v\left( {u - {S}} \right)}.}$

Events of type B.

For (i, j)ε Q let Z_((i,j))=1_(B) _(i,j) . Let Z=Σ_((i,j)εQ)Z_((i,j)).We use arguments that are similar to the ones from the previous section.To stress the similarity, we abuse the notation and denotation byY_(l,j(i,j)) the indicator of the event that h>i+1,s_(h)=l and

$\left( {d_{i,j} + {\sum\limits_{u = {i + 1}}^{h - 1}f_{1,u}}} \right) < {c_{h}.}$

Define Y_(l,j)=Σ_((i,j)εQ)Y_(l,h,(i,j)), Y_(l)=Σ_(h=1) ^(r)Y_(l,h).

Fact 2.12. Y_(l)≦f_(l).

Proof. Repeating the arguments from Fact 2.4, we have c_(h)1_(s) _(h)₌₁≦f_(l,h) and thus Y_(l,h)≦f_(l,h).

Fact 2.13.

${P\left( {Y_{l} > 0} \right)} \leq {\frac{f_{l}}{t}.}$

Proof. The proof is identical to the proof of Lemma 2.8.

Lemma 2.14. There exists absolute constants α,β such that eqn. 13implies that P(B_(i,j))≦0.01 for at least 0.99 f₁ pairs (i, j)ε Q.

Proof. Denote Y=Σ_(i=1) ^(n) Y_(l). It follows that Z≦Y and E(Z)≦E(Y).By Facts 2.13 and 2.12, it follows that

${E\left( Y_{l} \right)} \leq {\frac{f_{l}^{2}}{t}.}$

Thus, by eqn. 13,

${{E(Z)} \leq {E(Y)} \leq \frac{F_{2}}{t}} = {{\frac{G_{2}}{t} + {f_{1}\frac{f_{1}}{t}}} \leq {\left( {\alpha + \beta} \right){f_{1}.}}}$

We repeat the arguments from Lemma 2.9.

Events of type H

Definition 2.15.

Let U={u_(l), . . . , u_(t)} and W={w_(l), . . . , w_(t)} be twosequences of non-negative integers. Let (i,j) be a pair such that 1≦i≦tand 1≦j≦u_(i). Denote (i,j) as a loosing pair (with respect to sequencesU,W) if there exists h,i≦h≦t such that

${{- j} + {\sum\limits_{s = i}^{h}\left( {u_{s} - w_{s}} \right)}} < 0.$

Denote any pair that is not a loosing pair as a winning pair.

Now, we consider the following pair (U,W) then H_(i,j′) does not occurwhere j′ is such that m_(i,j′=1) and d_(i,j′)=f_(l,i)−j+1.

Proof. By Definition 2.15, for every i≦h≦r:

$\begin{matrix}{{{- j} + {\sum\limits_{l = i}^{h}u_{l}}} \geq {\sum\limits_{l = i}^{h}{w_{l}.}}} & {{Eqn}.\mspace{14mu} 26}\end{matrix}$

Since Σ_(l=i) ^(h)w_(i)=(h−i+1)λ and d_(i,j′)=f_(l,i)−j+1, for everyi≦h≦t,

${d_{i,j^{\prime}} + {\sum\limits_{l = {i + 1}}^{h}d_{l,1}}} = {{{- j} + 1 + {\sum\limits_{l = {i + 1}}^{h}f_{l,1}}} = {{{{- j} + 1 + {\sum\limits_{l = i}^{h}u_{l}}} \geq {{- j} + {\sum\limits_{l = i}^{h}u_{l}}} \geq {\sum\limits_{l = i}^{h}w_{l}}} = {\left( {h - i + 1} \right)\lambda}}}$

Substitute h by h−1 (for h>i), such that

${d_{i,j^{\prime}} + {\sum\limits_{l = {i + 1}}^{h}d_{l,1}}} \geq {\left( {h - i} \right){\lambda.}}$

Thus, H_(i,j′) does not occur, by eqn. 18.

Lemma 2.17. There exists an absolute constant α, such that eqn. 13implies that H_(i,j) does not occur for at least 0.99 f₁ pairs (i, j) εW.

Proof. By Lemma 2.20, there exists at least

$\sum\limits_{i = 1}^{r}\left( {u_{i} - w_{i}} \right)$

winning pairs (i,j) with respect to the (U,W). Also, Σ_(i=1)^(r)u_(i)=Σ_(i=1) ^(r)f_(l,i)=f₁ and Σ_(l=1) ^(r)uw=λr. Thus, thereexist at least f₁−λr winning pairs (i, j) with respect to the (U,W). Inthe statement of Fact 2.16, the mapping from j to j′ is bijection. Thus,there exist at exist f₁−λr winning pairs (i,j′) such that m_(i,j′)=1 andH_(i,j′) does not occur. By eqn. 13, we have f₁≧αλr and the lemmafollows.

Definition 2.18. Let U={u₁, . . . , u_(t)} and W={w_(l), . . . , w_(t)}be two sequences of non-negative integers. Let 1≦h<t. Let U′,W′ be twosequences of size t−h defined by p′_(i)=u_(i+h),q′_(i)=w_(i+h) for i=1,. . . , t−h. Denote U′,W′ as h-tail of the sequences U,W.

Fact 2.19. If (i, j) is a winning pair with respect to h-tail of U,Wthen (i+h, j) is a winning pair with respect to U,W. If (i,j) is awinning pair with respect to h-tail of U,W the (i, j) is a winning pairwith respect to U,W.

Proof. Follows directly from Definitions 2.15 and 2.18.

Lemma 2.20. If Σ_(s=1) ^(t)(u_(s)−w_(s))>0, then there exist at leastΣ_(s=1) ^(t)(u_(s)−w_(s)) winning pairs.

Proof. We use induction on t. For t=1, any pair (1, j) is winning if1≦j≦u₁−w₁. Consider t>1 and apply the following case analysis.

1. Assume that there exist 1≦h<t, such that Σ_(s=1) ^(h)(u_(s)−w_(s))≦0.Consider the h-tail of U,W. By induction and by Fact 2.19, there existat least Σ_(s=h+1) ^(t)(u_(s)−w_(s))≧Σ_(s=1) ^(t)(u_(s)−w_(s)) winningpairs with respect to U,W.

2. Assume that (1,u₁) is a winning pair. It follows that (1, j), j<u₁ isa winning pair as well. If Σ_(s=2) ^(t)(u_(s)−w_(s))>0, then, byinduction and by Fact 2.19, there exist at least Σ_(s=2)^(t)(u_(s)−w_(s)) winning pairs of the form (i,j) where i>1. In totalthere are u₁+Σ_(s=2) ^(t)(u_(s)−w_(s))≧Σ_(s=1) ^(t)(u_(s)−w_(s)) winningpairs with respect to U,W. The case when Σ_(s) ₂ ^(t)(u_(s)−w_(s))<0 istrivial.

Assume that eqns. 1 and 2 do not hold. Then, u_(l)>0. Indeed otherwiseu₁−w₁≦0 and, thus, eqn. 1 is true. Also, (1,1) is a winning pair.Indeed, otherwise there exists 1<h<t such that −1+Σhd i=1^(t)(u_(i)−w_(i))<0. All numbers are integers. Thus, Σ_(i=1)^(h)(u_(i)−w_(i))≦0 and eqn. 1 is true. Thus, (1,1) is a winning pairand (1,u₁) is not a winning pair (by eqn. 2). Therefore, there exist1<u≦u₁, such that (1,u−1) is a winning pair and (1,u) is not a winningpair. In particular, there exist 1≦h<t, such that

${{- u} + {\sum\limits_{s = 1}^{h}\left( {u_{s} - w_{s}} \right)}} < 0.$

On the other hand, (1,u−1) is a winning pair, thus,

$0 \leq {1 - u + {\sum\limits_{s = 1}^{h}{\left( {u_{s} - w_{s}} \right).}}}$

All numbers are integers and, thus, it can be concluded that

${\sum\limits_{s = 1}^{h}\left( {u_{s} - w_{s}} \right)} = {u - 1.}$

Consider the h-tail of U,W. By induction, there exists at least

${\sum\limits_{i = {h + 1}}^{t}\left( {u_{i} - w_{i}} \right)} = {{\sum\limits_{i = 1}^{t}\left( {u_{i} - w_{i}} \right)} - \left( {u - 1} \right)}$

winning pairs with respect to the h-tail of U,W of the form (i, j) ,where i>1. By properties of u, there exist additional (u−1) winningpairs of the form (1, j), j≦u−1. Summing up, the fact is obtained.

3. The Streaming Algorithm

Fact 3.1. Let v₁, . . . , v_(n) be a sequence of non-negative numbersand let k>2. Then,

$\left( {\sum\limits_{i = 1}^{n}v_{i}^{2}} \right)^{({k - 1})} \leq {\left( {\sum\limits_{i = 1}^{n}v_{i}^{k}} \right){\left( {\sum\limits_{i = 1}^{n}v_{i}} \right)^{({k - 2})}.}}$

Proof. Define

$\lambda_{i} = {\frac{v_{i}}{\sum\limits_{j = 1}^{n}v_{j}}.}$

Since g(x)=x^(k−1) is convex on the interval [0, ∞), we can applyJensen's inequality and obtain

$\left( \frac{\sum\limits_{i = 1}^{n}v_{i}^{2}}{\sum\limits_{i = 1}^{n}v_{i}} \right)^{k - 1} = {{\left( {\sum\limits_{i = 1}^{n}{\lambda_{i}v_{i}}} \right)^{({k - 1})} \leq \left( {\sum\limits_{i = 1}^{n}{\lambda_{i}v_{i}^{({k - 1})}}} \right)} = {\frac{\sum\limits_{i = 1}^{n}v_{i}^{k}}{\sum\limits_{i = 1}^{n}v_{i}}.}}$

Let D be a stream. Define:

$\begin{matrix}{{\psi = \frac{n^{1 - {({1/k})}}G_{k}^{1/k}}{F_{1}}},{\delta = {2\left\lceil {0.5\; {\log_{2}(\psi)}} \right\rceil}},{t = \left\lceil \frac{\delta \; F_{1}}{n^{1/k}} \right\rceil},{{\lambda = \left\lceil \frac{F_{1}\delta^{3}}{n} \right\rceil};}} & {{Eqn}.\mspace{14mu} 27}\end{matrix}$

where eqn. 2 was used to define F_(k). We will make the followingassumptions:

f ₁≦0.1F ₁ , t≦F ₁ ,F ₁(modt)=0   Eqn. 28.

Then, it is possible to define a r×t matrix M, where r=F₁/t and withentries m_(i,j)=p_(ir+j).

Fact 3.2. 1≦δ≦2_(n) ^((k−1)/2k).

Proof. Indeed G₁≦G_(k) ^(1/k)n^(1−1/k) by Holder inequality and sincef₁≦0.1F₁ by eqn. 28, we have ψ≧0.5. Thus, ┌0.5 log₂(ψ)┐≧0 and the lowerbound follows. Also, F_(k) ^(1/k) is the L_(k) norm for the frequencyvector since all frequencies are non-negative. Since L_(k)≦L₁, weconclude that ψ≦n^(1−1/k) and the fact follows.

Observe that there exists a frequency vector with δ=O(1): put f_(j)=1for all i ε [n]. At the same time there exists a vector withδ=Ω(n^((k−1)/2k)): put f₁=n and f_(j)=1 for j>2. It is not hard to seethat if δ is sufficiently large ten a naive sampling method will find aheavy element. For example, in the latter case, the heavy elementoccupies half of the stream.

Fact 3.3. λr≦4G_(k) ^(1/k).

Proof. Recall that F₁=rt. The fact follows from the definitions of λ andt.

Fact 3.4.

$\frac{G_{2}}{t} \leq {G_{k}^{1/k}.}$

Proof. Define

$\alpha = {\frac{k - 3}{2\left( {k - 2} \right)}.}$

We have by Holder inequality:

$\begin{matrix}{{G_{2}^{\alpha} \leq {G_{k}^{\frac{2\alpha}{k}}n^{\alpha {({1 - \frac{2}{k}})}}}} = {G_{k}^{\frac{k - 3}{k{({k - 2})}}}{n^{\frac{k - 3}{2k}}.}}} & {{Eqn}.\mspace{14mu} 29}\end{matrix}$

Also, by Fact 3.1:

$\begin{matrix}{G_{2}^{1 - \alpha} = {G_{k}^{\frac{k - 1}{2{({k - 2})}}} \leq {G_{k}^{\frac{1}{k{({k - 2})}}}{G_{1}^{\frac{1}{2}}.}}}} & {{Eqn}.\mspace{14mu} 30}\end{matrix}$

Thus,

${G_{2} \leq {G_{k}^{\frac{k - 3}{k{({k - 2})}}}n^{\frac{k - 3}{2k}}G_{k}^{\frac{1}{2{({k - 2})}}}F_{1}^{\frac{1}{2}}}} = {{G_{k}^{\frac{1}{k}}\frac{F_{1}}{n^{1/5}}\left( \frac{G_{k}^{\frac{1}{k}}n^{\frac{k - 1}{k}}}{F_{1}} \right)^{1/2}} = {t\; {G_{k}^{\frac{1}{k}}.}}}$

Fact 3.5.

$\frac{G_{3}}{\lambda \; t} \leq {G_{k}^{1/k}.}$

Proof. By Holder inequality:

G ₃≦G_(k) ^(3/k) n ^(1−(3/k))   Eqn. 31.

Thus,

$\frac{G_{3}}{\lambda \; t} = {\frac{n^{1 + {({1/k})}}G_{3}}{F_{1}^{2}\delta^{4}} \leq \frac{n^{2 - {({2/k})}}G_{k}^{3/k}}{F_{1}^{2}\delta^{4}} \leq {G_{k}^{1/k}.}}$

Theorem 3.6. Let M be a r×t matrix, such that eqn. 27 is true. Then,there exist absolute constants α,β such that:

αG _(k) ^(1/k) ≦f ₁ ≦βt   Eqn. 32;

imply:

$\begin{matrix}{{P\left( {S_{r} = 1} \right)} \geq {\frac{\delta}{2\; n^{1 - {({2/k})}}}.}} & {{Eqn}.\mspace{14mu} 33}\end{matrix}$

Proof. By eqn. 32 and Facts 3.3, 3.4, and 3.5,

${6\; {\alpha\left( {{\lambda \; r} + \frac{G_{3}}{\lambda \; t} + \frac{G_{2}}{t}} \right)}} \leq f_{1} \leq {\beta \; {t.}}$

Also, eqn. 27 implies

${f_{1}/t} \geq {\frac{\delta}{n^{1 - {({2/k})}}}.}$

Thus, eqn. 33 follows from Theorem 2.1.

This describes an exemplary implementation of pick-and-drop sampling inaccordance with the present invention, which can be represented inexemplary code as follows:

Algorithm 1 P&D (M, r, t, λ) Generate independent identicallydistributed random variables, {I_(j)}_(j=1) ^(r) with uniformdistribution on [t]. S₁ = m_(1,I) ₁ C₁ = d_(1,I) ₁ q₁ = 1. for i = 2 → rdo if (C_(i−1) < max{λq_(i−1,)c_(i)}) then S_(i) = s_(i) , C_(i) = c_(i), q_(i) = 1 else S_(i) = S_(i−1) , C_(i) = c_(i) + fs_(i,l) , q_(i) =q_(i−1) +1 end if end for Output (S_(r),C_(r))

Theorem 3.7. Denote f_(i) ^(k)>100Σ_(j≠i)f_(j) ^(k) as a heavy element.There exist a (constructive) algorithm that makes one pass over thestream and uses O(n^(1−2/k) log(n)) bits. The algorithm outputs a pair(i,{tilde over (f)}_(i)), such that with probability 1. If there existsa heavy element f_(i), then also with constant probability, thealgorithm will output (i, {tilde over (f)}_(i)), such that(1−ε)f_(i)≦{tilde over (f)}_(i).

Proof. Define t as in eqn. 27. Without loss of generality, we can assumeF₁ is divisible by t. Note that if t>F₁ or f₁≧0.1F₁, then it is possibleto find a heavy element with O(n^(1−2/k)) bits by existing methods suchas in Moses Charikar, Kevin Chen, and Martin Farach-Colton. Findingfrequent items in data streams. In ICALP '02: Proceedings of the 29thInternational Colloquium on Automata, Languages and Programming, pages693-703, London, UK, 2002. Springer-Verlag, which is incorporated hereinby reference. Otherwise, a stream D defines a matrix M for which wecompute O(n^(1−2/k)/ε δ) independent pick-and-drop samples. Since we donot know the value of δ, we should repeat the experiment for allpossible values of δ. Output the element with the maximum frequency.With constant probability, the output of the pick-and-drop sampling willinclude a (1−ε) approximation of the frequency f_(i). Thus, there willbe no other f_(j) that can give a larger approximation and replace aheavy element. The total space will define geometric series that sums toO(n^(1−2/k) log(n)).

If we know F₁ ahead of time then we can compute the value of t for anypossible δ and thus solve the problem in one pass. However, one can showthat the well-known doubling technique (when we double our parameter teach time the size of the stream doubles) will work in our case and,thus, one pass is sufficient even without knowing F₁.

Previously, such as described in Vladimir Braverman and RafailOstrovsky. Recursive sketching for frequency moments. CoRR,abs/1011.2571, 2010, which is incorporated herein by reference, wedeveloped a method of recursive sketches with the following property:Given an algorithm that finds a heavy element and uses memory μ(n), itis possible to solve the frequency moment problem in spaceO(μ(n)log^((c))(n)). In this previous work, we applied recursivesketches with the method of Charikar et.al. cited above. Thus, we canreplace the method from Charikar et al. with Theorem 3.7 and obtain:

Theorem 3.8. Let ε and k be constants. There exists a (constructive)algorithm that computes (1±ε)-approximation of F_(k), uses O(n^(1−2/5)log(n)log^((c))(n)) memory bits, makes one pass, and errs withprobability at most ⅓.

The above analysis focuses on the case when 1 is a “heavy element,” butit is possible to repeat arguments for any i. The above Theorem 2.1claims that 1 will be outputted with probability

$\Omega \left( \frac{f_{1}}{t} \right)$

for sufficiently large f₁. Notably, Theorem 2.1 holds for arbitrarydistributions of frequencies. Furthermore, as addressed with respect toTheorem 3.6, there exist r,t,λ such that a bound similar to eqn. 3holds. This new method can be combined with other method to obtain, forexample, Theorem 3.8

Thus, the pick-and-drop sampling method samples a heavy element, such asan element i with frequency Ω(F_(k))) with probability Ω(1/n^(1−2/k))and gives approximation {tilde over (f)}_(i)≧(1−ε)f_(i). In addition,the estimations never exceed the real values, that is f_(i)≦f_(j) forall j. As a result, the space complexity of finding a heavy element canbe reduced to O(n^(1−2/k) log(n)) bits. Recursive sketches can be usedto resolve the problem with O(n^(1−2/k) log(n)log^((c))(n)) bits.Advantageously, optimizing the space complexity as a function of ε canbe avoided.

The present invention has been described in terms of one or morepreferred embodiments, and it should be appreciated that manyequivalents, alternatives, variations, and modifications, aside fromthose expressly stated, are possible and within the scope of theinvention.

1. A database system comprising: a database; a database serverconfigured to control reading data from and writing data to thedatabase; an input to the database server configured to deliver a datastream formed of a sequence of elements, D={p₁, p₂, . . . , p_(m)} ofsize m of numbers from {1, . . . , n} to the database server; anon-transitive, computer-readable storage medium, having stored thereon,a computer program that, when executed by a processor, causes theprocessor to approximate frequency moments (F_(k)) in the data stream,such that a frequency of an element (i) is defined asf_(i)=|{j:p_(j)=i}| and a k-th frequency moment of D is defined as$F_{k} = {\sum\limits_{i = 1}^{n}m_{i}^{k}}$ in a single pass throughthe data stream by the steps of: (a) arranging a portion of the datastream in a matrix; (b) selecting an initial element in the matrix; (c)checking the matrix for a duplicate of the initial element; (d) uponidentifying a duplicate of the initial element in the matrix, assumingthat the initial element appears in each row of the matrix, assigningbinary values to all other frequencies, and disregarding the initialelement; (e) upon completing step (c) without identifying a duplicate ofthe initial element, assigning a binary value to all frequencies; (f)repeating steps (b) through (e) for a each subsequent element in thematrix; and (g) generating a report of approximated frequency moments inthe data stream.
 2. The database system of claim 1 wherein the processoris further caused to implement a local counter to count a number oftimes an element appears in a suffix of a row in the matrix.
 3. Thedatabase system of claim 2 wherein the processor is further caused toimplement a global counter incremented as a function of the localcounter.
 4. The database system of claim 3 wherein the processor isfurther caused to drop and re-initiate the global counter if the localcounter exceeds the global counter.
 5. The database system of claim 1wherein the processor is further caused to approximate f_(i) as ≧ afraction of f_(i) to limit memory resources used by the processor toestimate F_(k) to O(n^(1−2/k) log(n)) bits.
 6. A method forapproximating frequency moments (F_(k))) in data streams formed of asequence of elements, D={p₁, p₂, . . . , p_(m)} of size m of numbersfrom {1, . . . , n}, such that a frequency of an element (i) is definedas f_(i)=|{j:p_(j)=i}| and a k-th frequency moment of D is defined as${F_{k} = {\sum\limits_{i = 1}^{n}m_{i}^{k}}},$ the method comprisingthe steps of: (a) arranging a portion of the data stream in a matrix;(b) selecting an initial element in the matrix; (c) checking the matrixfor a duplicate of the initial element; (d) upon identifying a duplicateof the initial element in the matrix, assuming that the initial elementappears in each row of the matrix, assigning binary values to all otherfrequencies, and disregarding the initial element; (e) upon completingstep (c) without identifying a duplicate of the initial element,assigning a binary value to all frequencies; (f) repeating steps (b)through (e) for each subsequent element in the matrix; and (g)generating a report of approximated heavy elements in the data stream.7. The method of claim 6 wherein further comprising implementing a localcounter to count a number of times an element appears in a suffix of arow in the matrix.
 8. The method of claim 7 further comprisingimplementing a global counter incremented as a function of the localcounter.
 9. The method of claim 8 further comprising dropping andre-initiating the global counter if the local counter exceeds the globalcounter.
 10. The method of claim 6 further comprising approximatingf_(i) as ≧ a fraction of f_(i) to limit memory resources used by theprocessor to estimate F_(k) to O(n^(1−2/k) log(n)) bits.
 11. The methodof claim 6 further comprising limiting a degree of frequency moment (k)to greater than
 2. 12. A database system comprising: a database; adatabase server configured to control reading data from and writing datato the database; an input to the database server configured to deliver adata stream formed of a sequence of elements, D={p₁, p₂, . . . , p_(m)}of size m of numbers from {1, . . . , n} to the database server; anon-transitive, computer-readable storage medium, having stored thereon,a computer program that, when executed by a processor, causes theprocessor to approximate frequency moments (F_(k))) in the data stream,such that a frequency of an element (i) is defined asf_(i)=|{j:p_(j)=i}| and a k-th frequency moment of D is defined as$F_{k} = {\sum\limits_{i = 1}^{n}m_{i}^{k}}$ in a single pass throughthe data stream by: locating elements (i) with a frequency ΩF_(k) in thedata stream as heavy elements; approximating f_(i) as ≧ a fraction off_(i) to limit memory resources used by the processor to estimate F_(k)to O(n^(1−2/k) log(n)) bits.
 13. The database system of claim 12 whereinthe processor is further caused to limit a degree of frequency moment(k) to greater than
 2. 14. The database system of claim 12 wherein theprocessor is further caused to arranging a portion of the data stream ina matrix, select an initial element in the matrix, and check the matrixfor a duplicate of the initial element.
 15. The database system of claim14 wherein the processor is further caused to, upon identifying aduplicate of the initial element in the matrix, assume that the initialelement appears in each row of the matrix, assign binary values to allother frequencies, and disregard the initial element.
 16. The databasesystem of claim 14 wherein the processor is further caused to, uponcompleting checking the matrix without identifying a duplicate of theinitial element, assign a binary value to all frequencies.
 17. Thedatabase system of claim 14 wherein the processor is further caused toanalyze each subsequent element in the matrix by checking the matrix fora duplicate of each subsequent element.
 18. The database system of claim12 wherein the processor is further caused to generate a report ofapproximated frequency moments in the data stream.